\chapter{Effcient Global Optimization}\label{uq:ego}

Efficient Global Optimization (EGO) was developed to facilitate the 
unconstrained minimization of expensive implicit response functions.
The method builds an initial Gaussian process model as a global surrogate 
for the response function, then intelligently selects additional samples 
to be added for inclusion in a new Gaussian process model in subsequent 
iterations. The new samples are selected based on how much they are expected 
to improve the current best solution to the optimization problem.
When this expected improvement is acceptably small, the globally optimal 
solution has been found. The application of this methodology to 
equality-constrained reliability analysis is the primary contribution of 
EGRA.  

Efficient global optimization was originally proposed by
Jones et al.~\cite{Jon98}~and has been adapted into similar methods
such as sequential kriging optimization (SKO)~\cite{Hua06}.
The main difference between SKO and EGO lies within the specific formulation
of what is known as the expected improvement function (EIF), which is the
feature that sets all EGO/SKO-type methods apart from other global optimization
methods.
The EIF is used to select the location at which a new training point should be
added to the Gaussian process model by maximizing the amount of improvement in 
the objective function that can be expected by adding that point.
A point could be expected to produce an improvement in the objective function
if its predicted value is better than the current best solution, or if the
uncertainty in its prediction is such that the probability of it producing
a better solution is high.
Because the uncertainty is higher in regions of the design space with fewer
observations, this provides a balance between exploiting areas of the
design space that predict good solutions, and exploring areas where more
information is needed.

The general procedure of these EGO-type methods is:
\begin{enumerate}
\item Build an initial Gaussian process model of the objective function.
%\item Use cross validation to ensure that the kriging model is satisfactory.
\item Find the point that maximizes the EIF.
      If the EIF value at this point is sufficiently small, stop.
\item Evaluate the objective function at the point where the EIF is maximized.
      Update the Gaussian process model using this new point.
      Go to Step 2.
\end{enumerate}
%\noindent To construct a parallel algorithm, the $n$ best points could be 
%selected and evaluated in steps 2 and 3.
\noindent 

The following sections discuss the construction of the Gaussian process model
used, the form of the EIF, and then a description of how that EIF is modified
for application to reliability analysis.

\section{Gaussian Process Model}\label{uq:ego:gpm}

Gaussian process (GP) models are set apart from other surrogate models because
they provide not just a predicted value at an unsampled point, but also and
estimate of the prediction variance.
This variance gives an indication of the uncertainty in the GP model, which
results from the construction of the covariance function. 
This function is based on the idea that when input points are near one another,
the correlation between their corresponding outputs will be high.
As a result, the uncertainty associated with the model's predictions will be
small for input points which are near the points used to train the model,
and will increase as one moves further from the training points.

It is assumed that the true response function being modeled $G({\bf u})$ can 
be described by:~\cite{Cre91}
\begin{equation}
G({\bf u})={\bf h}({\bf u})^T{\boldsymbol \beta} + Z({\bf u})
\end{equation}
\noindent where ${\bf h}()$ is the trend of the model, 
${\boldsymbol \beta}$ is the vector of trend coefficients, and
$Z()$ is a stationary Gaussian process with zero mean (and covariance defined 
below) that describes the departure of the model from its underlying trend.
The trend of the model can be assumed to be any function, but
taking it to be a constant value has been reported to be generally sufficient.~\cite{Sac89}
For the work presented here, the trend is assumed constant and
${\boldsymbol \beta}$ is taken as simply the mean of the responses at
the training points.
The covariance between outputs of the Gaussian process $Z()$ at points 
${\bf a}$ and ${\bf b}$ is defined as:
\begin{equation}
Cov \left[ Z({\bf a}),Z({\bf b}) \right] = \sigma_Z^2 R({\bf a},{\bf b})
\label{eq:cov}
\end{equation}
\noindent where $\sigma_Z^2$ is the process variance and $R()$ is the
correlation function.
There are several options for the correlation function, but the 
squared-exponential function is common~\cite{Sac89}, and is used here for $R()$:
\begin{equation}
R({\bf a},{\bf b}) = \exp \left[ -\sum_{i=1}^d \theta_i(a_i - b_i)^2 \right]
\end{equation}
\noindent where $d$ represents the dimensionality of the problem
(the number of random variables), and 
$\theta_i$ is a scale parameter that indicates the correlation between the 
points within dimension $i$.
A large $\theta_i$ is representative of a short correlation length.

The expected value $\mu_G()$ and variance $\sigma_G^2()$ of the GP model 
prediction at point ${\bf u}$ are:
\begin{align}
\mu_G({\bf u}) &= {\bf h}({\bf u})^T{\boldsymbol \beta} + 
  {\bf r}({\bf u})^T{\bf R}^{-1}({\bf g} - {\bf F}{\boldsymbol \beta}) 
  \label{eq:exp} \\
\sigma_G^2({\bf u}) &= \sigma_Z^2 - 
  \begin{bmatrix} {\bf h}({\bf u})^T  & 
                  {\bf r}({\bf u})^T  \end{bmatrix}
  \begin{bmatrix} {\bf 0} & {\bf F}^T \\ 
                  {\bf F} & {\bf R}   \end{bmatrix}^{-1}
  \begin{bmatrix} {\bf h}({\bf u})    \\ 
                  {\bf r}({\bf u})    \end{bmatrix} \label{eq:var}
\end{align}
where ${\bf r}({\bf u})$ is a vector containing the covariance between 
${\bf u}$ and each of the $n$ training points (defined by Eq.~\ref{eq:cov}),
${\bf R}$ is an $n \times n$ matrix containing the correlation between each
pair of training points,
${\bf g}$ is the vector of response outputs at each of the training points, and
${\bf F}$ is an $n \times q$ matrix with rows ${\bf h}({\bf u}_i)^T$ (the
trend function for training point $i$ containing $q$ terms; for a constant
trend $q\!=\!1$).
This form of the variance accounts for the uncertainty in the trend 
coefficients $\boldsymbol \beta$, but assumes that the parameters governing
the covariance function ($\sigma_Z^2$ and $\boldsymbol \theta$) have known 
values.

The parameters $\sigma_Z^2$ and ${\boldsymbol \theta}$ are determined through 
maximum likelihood estimation.
This involves taking the log of the probability of observing the response 
values ${\bf g}$ given the covariance matrix ${\bf R}$, which can be written 
as:~\cite{Sac89}
\begin{equation}
\log \left[ p({\bf g} | {\bf R}) \right] = 
  -\frac{1}{n} \log \lvert{\bf R}\rvert - \log(\hat{\sigma}_Z^2) 
  \label{eq:like}
\end{equation}
\noindent where $\lvert {\bf R} \rvert$ indicates the determinant of ${\bf R}$,
and $\hat{\sigma}_Z^2$ is the optimal value of the variance given an estimate
of $\boldsymbol \theta$ and is defined by:
\begin{equation}
\hat{\sigma}_Z^2 = \frac{1}{n}({\bf g}-{\bf F}{\boldsymbol \beta})^T
  {\bf R}^{-1}({\bf g}-{\bf F}{\boldsymbol \beta})
\end{equation}
%\noindent where $\hat{\boldsymbol \beta}$ is the generalized least squares 
%estimate of $\boldsymbol \beta$ from:
%\begin{equation}
%\hat{\boldsymbol \beta} = \left[ {\bf F}^T{\bf R}^{-1}{\bf F} \right]^{-1}
%  {\bf F}^T{\bf R}^{-1}{\bf g}
%\end{equation}
\noindent Maximizing Eq.~\ref{eq:like} gives the maximum likelihood estimate 
of $\boldsymbol \theta$, which in turn defines $\sigma_Z^2$.

\section{Acquisition Functions}\label{uq:ego:acq}

The acquisition function determines the location of the next sampling point or refinement points, in the sense that maximizing the acquisition function yields the next sampling point, as
\begin{equation}
{\bf u}^* = \argmax_{\bf u} a({\bf u}).
\end{equation}

\subsection{Expected Improvement}\label{uq:ego:acq:eif}

The expected improvement function is used to select the location at which a 
new training point should be added.
The EIF is defined as the expectation that any point in the search
space will provide a better solution than the current best solution
based on the expected values and variances predicted by the GP model.
An important feature of the EIF is that it provides a balance between 
exploiting areas of the design space where good solutions have been found, and 
exploring areas of the design space where the uncertainty is high.
First, recognize that at any point in the design space, the GP prediction
$\hat{G}()$ is a Gaussian distribution:
\begin{equation}
\hat{G}({\bf u}) \sim \mathcal{N}\left( \mu_G({\bf u}), \sigma_G({\bf u}) \right)
\end{equation}
\noindent where the mean $\mu_G()$ and the variance $\sigma_G^2()$ were 
defined in Eqs.~\ref{eq:exp} and \ref{eq:var}, respectively.
The EIF is defined as:~\cite{Jon98}
\begin{equation}
EI\bigl( \hat{G}({\bf u}) \bigr) \equiv 
  E\left[ \max \left( G({\bf u}^*) - \hat{G}({\bf u}),0 \right) \right]  
\end{equation}
\noindent where $G({\bf u}^*)$ is the current best solution chosen from among 
the true function values at the training points (henceforth referred to as
simply $G^*$).
This expectation can then be computed by integrating over the distribution 
$\hat{G}({\bf u})$ with $G^*$ held constant:
\begin{equation}
EI\bigl( \hat{G}({\bf u}) \bigr) = 
  \int_{-\infty}^{G^*} \left( G^* - G \right) \, \hat{G}({\bf u}) \; dG  
  \label{eq:eif_int}
\end{equation}
\noindent where $G$ is a realization of $\hat{G}$.
This integral can be expressed analytically as:~\cite{Jon98}
\begin{equation}
EI\bigl( \hat{G}({\bf u}) \bigr) = \left( G^* - \mu_G \right) \,
  \Phi\left( \frac{G^* - \mu_G}{\sigma_G} \right) + \sigma_G \,
  \phi\left( \frac{G^* - \mu_G}{\sigma_G} \right) \label{eq:eif}
\end{equation}
\noindent where it is understood that $\mu_G$ and $\sigma_G$ are functions of 
${\bf u}$.
Rewritting in a more compact manner and dropping the subscript $_G$, 
\begin{equation}
\label{eq:eifShort}
a_\text{EI}({\bf u}, \{{\bf u}_i,y_i \}_{i=1}^N,\theta)) = \sigma({\bf u}) \cdot( \gamma({\bf u}) \Phi(\gamma({\bf u}) ) + \phi(\gamma({\bf u})) ),
\end{equation}
where $\gamma({\bf u}) = \frac{G^* - \mu({\bf u})}{\sigma({\bf u})}$. This equation defines the expected improvement acquisition function for an unknown ${\bf u}$. 

The point at which the EIF is maximized is selected as an additional training 
point.
With the new training point added, a new GP model is built and then used 
to construct another EIF, which is then used to choose another new training 
point, and so on, until the value of the EIF at its maximized point is below 
some specified tolerance.
In Ref.~\cite{Hua06} this maximization is performed using a Nelder-Mead
simplex approach, which is a local optimization method.
Because the EIF is often highly multimodal~\cite{Jon98} it is expected that 
Nelder-Mead may fail to converge to the true global optimum.
In Ref.~\cite{Jon98}, a branch-and-bound technique for maximizing the EIF
is used, but was found to often be too expensive to run to convergence.
In Dakota, an implementation of the DIRECT global optimization algorithm is 
used~\cite{Gab01}.

It is important to understand how the use of this EIF leads to optimal
solutions.
Eq.~\ref{eq:eif} indicates how much the objective function value at ${\bf x}$ 
is expected to be less than the predicted value at the current best solution. 
Because the GP model provides a Gaussian distribution at each predicted 
point, expectations can be calculated.
Points with good expected values and even a small variance will
have a significant expectation of producing a better solution (exploitation), 
but so will points that have relatively poor expected values and greater 
variance (exploration).

The application of EGO to reliability analysis, however, is made more 
complicated due to the inclusion of equality constraints 
(see Eqs.~\ref{eq:ria_opt}-\ref{eq:pma_opt}).
For inverse reliability analysis, this extra complication is small.
The response being modeled by the GP is the objective function of the optimization 
problem (see Eq.~\ref{eq:pma_opt}) and the deterministic constraint might be handled 
through the use of a merit function, thereby allowing EGO to solve this 
equality-constrained optimization problem.
Here the problem lies in the interpretation of the constraint for multimodal
problems as mentioned previously.
In the forward reliability case, the response function appears in the
constraint rather than the objective.
Here, the maximization of the EIF is inappropriate because feasibility is
the main concern.
This application is therefore a significant departure from the original
objective of EGO and requires a new formulation.
For this problem, the expected feasibility function is introduced.


\subsection{Probability Improvement Acquisition Function}\label{uq:ego:acq:pi}

The probability of improvement (PI) acquisition function is proposed by \cite{kushner1964new}, using the same argument that the GP prediction is a Gaussian distribution. 
Similar to Equation \ref{eq:eifShort}, the PI acquisition function is given by
\begin{equation}
a_{\text{PI}}({\bf u}) = \Phi(\gamma({\bf u})).
\end{equation}
% where $\gamma({\bf u}) = \frac{G^* - \mu({\bf u})}{\sigma({\bf u})} $. 
Generally speaking, the EI acquisition function performs better than the PI acquisition function. 

\subsection{Lower-Confidence Bound Acquisition Function} \label{uq:ego:acq:lcb}

Another form of acquisition is lower-confidence bound (LCB), proposed recently by Srinivas et al. \cite{srinivas2009gaussian,srinivas2012information}, which has shown to perform very well. 
The LCB acquisition function takes the form of
\begin{equation}
a_{\text{LCB}}({\bf u}) = - \mu({\bf u}) + \kappa \sigma({\bf u}),
\end{equation}
where $\kappa$ is a hyper-parameter describing the acquisition exploitation-exploration balance. In many cases in design optimization, $\kappa = 2$ is preferred, but relaxing this $\kappa$ as a function of iterations is also possible, cf. Daniel et al. \cite{daniel2014active}, as
\begin{equation}
\kappa = \sqrt{\nu \gamma_n},\quad \nu = 1, \quad \gamma_n = 2\log{\left(\frac{N^{d/2 + 2}\pi^2}{3\delta} \right)},
\end{equation}
and $d$ is the dimensionality of the problem, and $\delta \in (0,1)$ \cite{srinivas2012information}. 

\section{Batch-sequential parallel}

The batch-sequential parallelization is mainly motivated by exploiting the computational resource, where multiple sampling point ${\bf u}$ can be queried concurrently on a high-performance computing platform. 
The benefit of batch implementation is that the physical time to converge to the optimal solution is significantly reduced with a factor of $\sqrt{K}$, where $K$ is the batch size. 
While there are many flavors of batch-sequential parallelization, as well as asynchronous parallelization in EGO and Bayesian optimization, we mainly review the theory of GP-BUCB by Desautels et al. \cite{desautels2014parallelizing}, GP-UCB-PE by Contal et al \cite{contal2013parallel}, and pBO-2GP-3B by Tran et al \cite{tran2019pbo}. 
The parallelization feature of EGO is sometimes referred to as lookahead or non-myopic Bayesian optimization in the literature, especially in the machine learning community. 

The approach by Desautels et al. \cite{desautels2014parallelizing} mainly advocates for the ``hallucination'' scheme or heuristic liar, in which the unknown observation at the currently querying sampling point ${\bf u}^*$ is \emph{temporarily} assumed as the posterior mean $\mu({\bf u}^*)$. 
Then, the underlying GP model updates based on this assumption and locates other points in the same batch, until the batch is filled. 
After the whole batch is constructed, it is then queried, and all the responses are received at once when the batch is completed. 
Contal et al. \cite{contal2013parallel} extended from the work of Desautels et al. \cite{desautels2014parallelizing} and proved that including pure exploration (i.e. sampling at ${\bf u}^*$ where $\sigma({\bf u})$ is maximum) increases the efficiency. Tran et al. \cite{tran2019pbo} adopted two aforementioned approaches and extended for known and unknown constraints. 

\section{Asynchronous batch parallel}

The asynchronous batch parallel EGO is implemented based on the idea of further leveraging computational efficiency when the computational query cost varies widely.
In this scenario, the batch-sequential parallel EGO finishes one iteration when the last worker of the batch finishes.
This mechanism makes the other workers, which might have finished the jobs or simulations earlier, wait for the last worker to finish, thus creating an unnecessary idle period.        
The asynchronous batch parallel scheme is, therefore, created to accelerate the optimization process by immediately assigning the next jobs to workers that have finished earlier jobs, without waiting for each other.
When workers finish one query, the objective GP is updated, and the next sampling point is found by maximizing the acquisition function. 
Numerical comparison results are shown in one of our previous works \cite{tran2022aphbo}, across a number of numerical functions and some engineering simulations as well. 



